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The biological activity and functional specificity of proteins depend on their native three-dimensional struc- 
tures determined by inter-and intra-molecular interactions. In this paper, we investigate the geometrical factor 
of protein conformation as a consequence of energy minimization in protein folding. Folding simulations of 10 
polypeptides with chain length ranging from 183 to 548 residues manifest that the dimensionless ratio {V/A{r)) 
of the van der Waals volume V to the surface area A and average atomic radius (r) of the folded structures, 
calculated with atomic radii setting used in SMMP [Eisenmenger R, et. al, Comput. Phys. Commun., 138 
(2001) 192], approach 0.49 quickly during the course of energy minimization. A large scale analysis of protein 
structures show that the ratio for real and well-designed proteins is universal and equal to 0.491 ± 0.005. The 
fractional composition of hydrophobic and hydrophilic residues does not affect the ratio substantially. The ratio 
also holds for intrinsically disordered proteins, while it ceases to be universal for polypeptides with bad folding 
properties. 



PACS numbers: 87.14.E-, 87.15.A-, 87.15.-v 



I. INTRODUCTION 

In recent decades, physical methods have been widely used 
to study properties and structures of biopolymers |ll]-|31, in- 
cluding DNA 1 4, 5], RNA |6], and protein ItUioIi. Proteins 
assume specified conformations from their chemical compo- 
sitions or sequences to develop biological activity and func- 
tional specificity. The corresponding three-dimensional (3D) 
structures are a consequence of inter- and intra-molecular in- 
teractions, in which energy minimization is the principle gov- 
erning the folding tendency. In spite of various components 
involved in the interactions, there has been attempts to de- 
rive simple geometric factors from a variety of conformations, 
which can be either considered as a factor for structure valid- 
ity or used as an effective constraint in folding simulation. 

Geometric properties of protein molecules have been stud- 
ied for more than three decades lUll - llSll . Among others, the 
Ramachandran plot Ill4ll is a practical criterion widely used 
for improving the quality of NMR or crystallographic protein 
structures. In a polypeptide, the main chain N-Cq and Cq-C 
bonds are relatively free to rotate, and can be respectively rep- 
resented by two torsion angles. These angles can only appear 
in certain combinations due to steric hindrances, which define 
allowed regions of the torsion angles for secondary structures 
in the plot. 

Furthermore, it has been found that the mean volume of an 
amino acid in the interior of proteins is very close to that of 
the amino acid in crystals lllll[l2ll . With the help of the Delau- 
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nary triangulation method, Liang and Dill lUSll have reported 
that the protein packing is heterogeneous, and in terms of 
packing density, protein molecules may be either well-packed 
or loosely packed. Zhang et al. |16l] showed that the packing 
density of single domain proteins decreases with chain length, 
which shares a generic feature of random polymers satisfying 
loose constraint in compactness. 

Beside the Ramachandran plot and the packing density 
which are conclusions based on observations, there has been 
theoretical models introduced to simulate properties of pro- 
tein geometric structures. For example, Banavar and Maritan 
lUTIl have introduced the effective backbone tube model to an- 
alyze the secondary structures of proteins under the constraint 
of minimum energy and showed that the tube has an effective 
radius of 2.7A. 

When a polypeptide folds, the hydrophobic effects cause 
nonpolar side chains to cluster together in the protein interior 
or interface, whereas polar side chains tend to maximize the 
contacts with outer solvent molecules. The stability of the sys- 
tem is partially due to the burial of the nonpolar residues, and 
can be measured by the loss of the solvent accessible surface 
area Iii8,-,21J. An atom or group of atoms is defined as acces- 
sible if a solvent molecule of specified size, generally water, 
can be brought into van der Waals contact. The solvent ac- 
cessible surface is then simply defined as the surface traced 
out by the center of a probe sphere, which represents the sol- 
vent molecule, as it rolls over the van der Waals surface of 
the protein |20, 22]. Hence, volume and surface area are suit- 
able parameters to characterize the geometrical conformation 
of protein. 



II. METHODS 



A. Folding simulation 



We used the SMMP package 111 |24D for protein fold- 
ing simulation and simulated annealing, as well as canonical 
Monte Carlo method, to generate folded structures. Starting 
with a polypeptide in a solvent, the SMMP searches the lowest 
energy conformation by utilizing the energy function 
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Here r^ is the distance in A between atoms i and j. Aij, Bij, 
Cij, and Dij are parameters of the empirical potentials, qi 
and Qj are the partial charges on the atoms i and j , respec- 
tively, e = 2 is the dielectric constant of the protein interior 
space. The factor 332 in Eq.lO is used to express the energy in 
kcal/mol. C/„ is the energetic torsion barrier of rotation about 
the bond n and k„ is the multiplicity of the torsion angle </>„ 
lUTIl . The input file for SMMP is a sequence of amino acids 
and the output file is in the Protein Data Bank (PDB) format 
II25II . The protein-solvent interactions were implemented with 
the implicit water solvation by selecting type 1 solvent in the 
SMMP main.f program. All parameters needed for the simu- 
lation have been self-contained in the SMMP package. 



B. Calculation of volume and surface area 

To compute the volume V and surface area A of the 
polypeptide in the course of folding simulation, we used the 
ARVO package [26] developed based on analytic equations 
II22II . ARVO can calculate V and A of a system of N atoms, 
which can overlap in any way. The main idea of the algorithms 
of ARVO is converting computation of volume and surface 
area of overlapping spheres as surface integrals of the sec- 
ond kind over closed regions. Using stereographic projection, 
one can transform the surface integrals to a sum of double 
integrals which are then reduced to curve integrals 11221 l26ll . 
It has been shown that the Van der Waals surface areas ll27ll 
computed by the GETAREA module in FANTOM package 
II28I I29I and ARVO module are consistent f26]. Comparing 
with programs implementing different algorithms and approx- 
imations to describe geometrical properties of atomic groups, 
the differences among the computed surface area by VOLBL 



Q, GEPOL mm and ARVO |^ are less than 1%, and 
the differences among the computed volumes are about 2% 
(see Refs. ll26ll and IBSll for detailed discussions). On the basis 
of analytical method, the accuracy of the computation of vol- 
ume and surface area of protein molecules by using ARVO is 
superior to numerical integration which always contains nu- 
merical errors fi33il . 

The input file for ARVO contains the coordinates 
{xi,yi, Zi) of the center and radius ri of all N atoms in the 
system, where 1 < i < A^. The atoms can overlap in any 
way. To calculate van der Waals surface area A and volume V 
of a PDB protein structure, we used the coordinates of carbon 
(C), nitrogen (N), oxygen (O), and sulfur (S) of the PDB data 
and van der Waals radii of C, N, O, and S as input data. Ac- 
cording to the conventional parameter settings in protein fold- 
ing simulations 1 23, .24.. .34. - 37.1 . N atom has (van der Waals) 
radius 1.5 5 A, C atom has radius 1.5 5 A, S atom has radius 
2.OOA, and O atom has radius 1.40A. The relatively smaller 
radius of hydrogen (H) atom is neglected; a water (solvent) 
molecule is represented by an O atom with radius 1.40 A. The 
radii of these atoms at the atomic level are determined by the 
densities of the electron cloud, and they are self-consistent 
with other physical quantities used in the SMMP simulation 
ll23il24ll . Further, to calculate solvent accessible surface area 
As and related volume Vg of the protein structure, we added 
radius of the solvent 1.40A to van der Waals radii of C, N, 
O, and S, i.e. the effective radii of C, N, O, and S ai-e 2.95 A, 
2.95 A, 2.8OA, and 3.40A mHi], respectively. The average 
atomic radius (r) and average effective radius {rg) of folded 
structures are calculated using these radii. 



III. RESULTS AND DISCUSSIONS 

A. V/A{r) ratio for the van der Waals volume V and surface 
area A and average atomic radius (r) 

The ratio R = V/A{r) and the total energy are computed 
in the time course of simulation. In all cases of our simu- 
lation, the final energy using canonical Monte Carlo method 
is lower than using simulated annealing. The results of 10 
small proteins (with 183 < TV < 548) (Table IB reveal that R 
approaches to w 0.49 as the energy decreases, while the resul- 
tant structures are not necessary close to native structures. It 
turns out that the energy minimization criterion is likely con- 
nected with the geometric conformation defined by the ratio 
R ~ 0.49. 

To confirm that the ratio R ~ 0.49 is relevant, we have 
tested 743 PDB structure data from the Protein Culling Server 
ll38ll . in which only X-ray data with high resolution have been 
selected. The ratio is found to be i? = 0.491 ± 0.005. To 
determine a reasonable tolerance for the ratio, we have also 
tested a larger database from the Protein Data Bank. Totally 
31059 PDB entries deposited at the Protein Data Bank in June 
2005 have been downloaded for the test. After excluding non- 
proteins, such as DNA and RNA, and problematic structures 
in which only a carbons are included, there are finally 28664 
protein structures involved in statistics. In our analysis, both 



(a) 



2.5x10' 
2.0x10= 
1.5x10= 
1.0x10= 
5.0x10'' 
0.0 



- 


<. V(A') 


^^ 


- 


- A(A=) 


12.436+0.001^^^^ 


- 


^^ 


^^^^^eB«^^^^38+0.001 


^ 


^t^ 


I.I. 



5000 



10000 
N 



15000 



20000 



(b) 0.20 



0.15 - 



CL 0.10 



0.05 



0.00 



1 Real Gaussian fit 




i \ Extended Gaussian fit 




^ 


\ 




/ 


\ 




/ 


\ 


' ! 1 


\ 


\ V 


1 1 
i \ 


" .^ , 




1, ^'% 



0,47 0.48 0.49 0,50 0.51 0.52 0.53 

R 



(0) 



T^ 2ffl|r 




(e) 0.06 



0.04 - 






-Gaussin fit 



0.02 - 



0.00 1 — tfl- 



04 




2.0 



FIG. 1. (a) Dependence of V and A on iV for 28664 PDB pro- 
tein structures. The numbers indicate the slopes, (b) The dis- 
tribution P(R) (blue, solid line) for the structures shown in (a). 
The Gaussian fit with the maximum located at i? ~ 0.491, with 

s „^^ r 2(^-^c)" 1 , jyo = 0.0053, Zc = 0.4910, w = 



y ^Vo + - 



= exp 



0.0046, and S = 0.0008. The estimation of the fitting is adjusted 
5R^ = 0.984. The histogram of R for 932 artificial extended struc- 
tured is shown in red (dotted line). The green (dash-dotted) line 
refers to the Gaussian fit with maximum at i? ~ 0.5120. (c) A 
typical compact PDB structure (PDB code: IVII). (d) An extended 
structure obtained by Swiss-pdb viewer 3.7 (SP5). (e) The distribu- 
tion of Rs for 28236 proteins using the probe sphere with radius of 
1.4A. The maximum of the Gaussian fit is located at Rs ~ 1.2402. 



TABLE I. The V/A{r) ratio for 10 typical proteins structures. The 
R' is the V/A{r) ratio of the structure with a randomly chosen con- 
figuration by the SMMP package |23, 24], R" is for final structure 
after performing the folding simulation, subscript "a" stands for sim- 
ulated annealing and "c" for canonical Monte Carlo, and R is for the 
structure from PDB. 



PDB iV R', 



(al_ 



^M 



^M 



R 



1HP9 183 0.5694 0.4912 0.4875 0.5047 

IKDL 193 0.5364 0.4867 0.4863 0.4998 

IGCN 246 0.5480 0.4917 0.4875 0.4946 

IVII 295 0.5560 0.4876 0.4878 0.5058 

2PLH 330 0.6280 0.4864 0.4857 0.4954 

20V0 418 0.5549 0.4981 0.4891 0.4933 

IPGB 436 0.5385 0.4866 0.4878 0.4891 

IHPT 440 0.5726 0.4879 0.4871 0.4928 

lUOY 452 0.5414 0.4896 0.4872 0.4945 

lUTG 548 0.5785 0.4780 0.4858 0.4896 



X-ray and NMR data have been used. For NMR data con- 
sisting of more than one model, we selected the first model 
which is considered as the most accurate one or is an average 
of the models. We plotted the dependence of van der Waals 
surface area A and volume V on the total number of (C, N, 
O, and S) atoms N in Fig. [TJa) which shows that A and V 
increase linearly with N. The linear correlation between the 
volume V and the number of atoms A^ or area A has been 
found by Lorenz et al. |39] by using the Monte Carlo studies 
with the model of clusters of random uncoiTelated spheres. 
Similar results of the linear relations have also been discussed 
by Liang and and Dill |15] with 636 protein structures. The 
result in Fig. [TJa) provides a more solid demonstration from 
the basis of a larger database. Furthermore, we plotted the 
distribution of i? = V/A{r) as histograms in blue (solid line) 
in Fig. [lib), which locates in a very narrow interval centered 
at i? = 0.4910. 

R « 0.491 implies that one cannot imagine a protein as 
a chain of small spheres because in this case we would have 
Re = (47r(r)3/3) /47r(r)2(r) = 1/3 « 0.333. However, the 
result R « 0.491 might be understood qualitatively by con- 
sidering that a protein consists of tubes of radius (r) . There is 
a tube to represent the backbone of the protein; there are also 
some tubes to represent side chains of the protein. The total 
length of tubes isl ^ N . Using V k, Tr{r)'^l and A sa 27r(r)/ 
we obtained V/A{r) — 1/2 w 0.5 which is consistent with 
our numerical result. The linear dependence of V and A on 
I r^ N is supported by Fig. ^a). It is woith noting the linear 
coiTelation is independent of the settings of the radii of atoms. 
If other radii are used, the linear relation remains but the ratio 
is different. The ratio derived from the average of an ensem- 
ble of 715 PDB protein chains (selected by Protein Sequence 
Culling Server II38I1 ). using the Richard's parameters [18|, is 
V/A(r) = 0.5589 ± 0.0114. Similarly, using the Protori radii 
IHOll , the result is V/A{r) = 0.5288 ± 0.0113. The relation 
V/A{r) w 1/2 approximately holds in the two cases. 

To clarify the relation between the ratio R ~ 0.491 and the 
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FIG. 2. (a) Probability density function of R statistics for whole 
protein molecules (28664 samples, Gaussian distribution centered at 
R = 0.4910), helix structures (extracted from 26040 samples, R — 
0.4859), sheet structures (24537 samples, R = 0.4808) and other 
structures (25513 samples, R = 0.4811). (b) i? as a function of the 
fraction of atoms in helix structures. The slope of the linear fit (black 
dashed line) is 0.0001, and the correlation level is 0.005. (c) R as 
a function of the fraction of atoms in sheet structures. The slope of 
the linear fit (black dashed line) is 0.0002, and the correlation level 
is 0.005. 



[TJb) show the distribution of R for 932 artificial structures. Its 
maximum locates at i? « 0.5120 which is higher compared 
to real protein structures. This interesting result confirms that 
the value R « 0.491 comes from the requirement for the for- 
mation of compact native conformations as a result of energy 
minimization. 

Further, direct comparisons of volumes and surface areas 
for real and extended structures show that from an extended 
structure to a real structure, there is a small change (increase 
or reduction) in volume while there is usually a large increase 
in surface such that the ratio changes from 0.512 to 0.491. 
All of these indicate that the larger ratio of extended structure 
is attributed to nonphysical geometrical properties, such as 
loosely connections of monomers, and unbalance of electro- 
static interactions among monomers, and interactions between 
monomers and water molecules. It should be noted that both 
real and artificial structures satisfy the requirements imposed 
by the Ramachandran plot, but only the former has protein- 
like properties. Thus, R w 0.491 can serve as an useful factor 
for selecting three-dimensional protein-like structures. In ad- 
dition, we have also found that the beta structures extracted 
from PDB protein structures have smaller R = 0.4808 in 
comparison with the helixes {R — 0.4859), as shown in Fig. 
|2a). Whereas the ratios for individual secondary structures 
are different, a protein molecule as whole is a self-organized 
geometric unit, which blends various secondary stiTictures to 
form a properly folded 3D structure. In contrast to secondary 
structures, tertiary and quaternary structures then have univer- 
sal property of i? ~ 0.491 regardless of their details. Defin- 
ing the relative beta/helix content of a protein as a number of 
amino acids belonging to beta strands/helix structures divided 
by its total number of residues, we found that, as shown in 
Figs- 12b) and |2c), there is no coiTelation between R and 
beta- as well as helix-content as the correlation level for the 
linear fits is very low (0.005) for both cases. 



B. Va/Asirs) ratio for solvent accessible volume 14 and 
surface area As and average effective radius (r^) 

In order to compare the distributions of R (with zero ra- 
dius of solvent) and Rg = Vs/As{rs) (with radius of solvent 
I.4A), we calculated the distributions of R and Rg for 28236 
protein structures from PDB. The histogram of R is not shown 
because it is similar to the larger set of 28664 protein struc- 
tures (Fig. [Ifb)). The distribution of Rs (Fig. [TJe)) has a 
maximum at i?, ~ 1.2402. 



compactness of native structures, we computed R for artifi- 
cial extended structures of protein molecules. The extended 
structures are obtained by setting all torsion angles of existing 
3D structures from PDB equal to 180°, using the Swiss-pdb 
viewer 3.7 (SP5) (http://www.expasy.0rg/spdbv/1. A typical 
compact PDB structure and extended structure are shown in 
Fig- He) and Fig. [Tfd), respectively; the latter is similar to 
those obtained from mechanical unfolding of proteins studied 
in Refs. l.4U.421 . The histograms in red (dotted line) in Fig. 



C. V/A{r) ratio and liydrophobicity of amino acids 

Consider a polypeptide chain consisting of a sequence of 
amino acids with different hydrophobicities. The hydropho- 
bic condensation drives the polypeptide chain toward a con- 
formation with lower free energy. This is achieved by burying 
hydrophobic contents into interior and reside polar monomers 
on the surface contacting with water This process involves 
not only the regulation of the connections between monomers 




FIG. 3 . Upper panel shows the distribution of the ratio of hydrophilic 
(-ff+) and hydrophobic (H-) amino acids in a molecule for 723 pro- 
tein structures (from Protein Sequence Culling Server niSn ), based on 
the Kyte-Doolittle scale |43]. The Gaussian fit is centered at 0.594. 
Lower panel shows 7? as a function of H+/H- . The slope of the lin- 
ear fit (blue dashed line) is —0.0033, and the correlation level is 0.2. 
The inset shows the distribution of R for the 723 protein structures. 



but also compensations of volume and surface area. Accord- 
ing to the statistics shown in Fig. [3]for 723 protein structures 
(from Protein Sequence Culling Server [38]), the ratio of hy- 
drophilic and hydrophobic amino acids (H+/H-) of proteins 
in the Kyte-Doolittle scale |43] is generally in a narrow range 
with respect to variable range of H+/H^, suggesting that the 
universality of R is probably a consequence of compositions 
of hydrophilic and hydrophobic amino acids in a protein. The 
linear fit for i? as a function of H+/H^ (lower part of Fig. 
|3]l gives the correlation level of 0.2. Since this level is no- 
tably lower than 0.5 there is no correlation between these two 
quantities. This is not unexpected because R varies in a very 
narrow interval. For this reason, one can show that R does 
not correlate with individual values of iJ+ and iJ_. Fur- 
thermore, folding simulations of ten polypeptides with fixed 
_ff+/_ff_ and randomized sequences show that the averages of 
the ratios are (R') = 0.548 ± 0.013 for initial structure and 
{R") = 0.490 ± 0.001 for final structure after energy mini- 
mization. This implies that the ratio is not only the property 
of disordered protein, but is also that of random copolymers . 



D. V/A{r) ratio for intrinsically disordered proteins 

It is also of interest to study the ratio for the intrinsically 
disordered proteins which usually lead to misfolding |44, 45]. 
We have calculated R for 38 protein structures [44] and found 
that the ratio is 0.4906 ±0.005(see Fig. lU, which is within the 
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FIG. 4. (a) Volume \^ as a function of surface area A for 38 disor- 
dered proteins [44]. (b) The histogram of the structures shown in (a). 
The Gaussian fit with the maximum located at i? ~ 0.4906. 



tolerance determined by the ensemble of 28664 PDB protein 
structures. Thus, the ratio holds once a polypeptide fold to a 
compact structure no matter of its species. 



IV. SUMMARY AND CONCLUSIONS 

In summary, we have found a universal ratio of the van 
der Waals volume to the surface area and average atomic ra- 
dius of folded structures R = 0.491 ± 0.005 for native pro- 
tein structures, including intrinsically disordered proteins. We 
have studied the connection between the energy minimization 
and geometric conformation by monitoring the ratio R during 
folding simulations using the SMMP package 0231 1241] . Our 
results reveal that R 2± 0.491 should be somewhat related to 
the energy global minimum of protein molecules. This result 
can be imposed as a rule in searching for native conformations 
in folding simulations using protein sequences. R w 0.491 
can also serve as a necessary condition for checking the valid- 
ity of PDB data and designing protein-like sequences. 

It is well known that hydrophobic residues are buried in 
the core of proteins and the van der Waals volume should be, 
therefore, proportional to the number of such residues. The 
van der Waals area should linearly depend on the number of 
hydrophilic residues, which have tendency to reside on the 
protein surface. Thus, the universality of R is probably a con- 
sequence of the fact that the ratio of the hydrophobic and hy- 
drophilic amino acids of proteins is roughly a constant. 

Here we should emphasize that R ~ 0.491 does not cor- 
respond to a unique conformation, but it confines molecular 



conformations in a folding simulation from vast possibilities 
to a smaller space. It excludes improperly folded structures 
which are characterizable by such geometrical properties and 
is beneficial for the reduction of simulation time. One possi- 
ble implementation of this property shall be in the calculation 
of surface energy associated with the solvent access area. A 
preliminary test of the V/A{r), working as a filter, can be 
performed before next update step in simulations. Other inde- 
pendent factors can work together to define the conformation 
to have a native-like structure. 
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